50 research outputs found
Predicting Role Relevance with Minimal Domain Expertise in a Financial Domain
Word embeddings have made enormous inroads in recent years in a wide variety
of text mining applications. In this paper, we explore a word embedding-based
architecture for predicting the relevance of a role between two financial
entities within the context of natural language sentences. In this extended
abstract, we propose a pooled approach that uses a collection of sentences to
train word embeddings using the skip-gram word2vec architecture. We use the
word embeddings to obtain context vectors that are assigned one or more labels
based on manual annotations. We train a machine learning classifier using the
labeled context vectors, and use the trained classifier to predict contextual
role relevance on test data. Our approach serves as a good minimal-expertise
baseline for the task as it is simple and intuitive, uses open-source modules,
requires little feature crafting effort and performs well across roles.Comment: DSMM 2017 workshop at ACM SIGMOD conferenc
Named Entity Resolution in Personal Knowledge Graphs
Entity Resolution (ER) is the problem of determining when two entities refer
to the same underlying entity. The problem has been studied for over 50 years,
and most recently, has taken on new importance in an era of large,
heterogeneous 'knowledge graphs' published on the Web and used widely in
domains as wide ranging as social media, e-commerce and search. This chapter
will discuss the specific problem of named ER in the context of personal
knowledge graphs (PKGs). We begin with a formal definition of the problem, and
the components necessary for doing high-quality and efficient ER. We also
discuss some challenges that are expected to arise for Web-scale data. Next, we
provide a brief literature review, with a special focus on how existing
techniques can potentially apply to PKGs. We conclude the chapter by covering
some applications, as well as promising directions for future research.Comment: To appear as a book chapter by the same name in an upcoming (Oct.
2023) book `Personal Knowledge Graphs (PKGs): Methodology, tools and
applications' edited by Tiwari et a
Using Contexts and Constraints for Improved Geotagging of Human Trafficking Webpages
Extracting geographical tags from webpages is a well-motivated application in
many domains. In illicit domains with unusual language models, like human
trafficking, extracting geotags with both high precision and recall is a
challenging problem. In this paper, we describe a geotag extraction framework
in which context, constraints and the openly available Geonames knowledge base
work in tandem in an Integer Linear Programming (ILP) model to achieve good
performance. In preliminary empirical investigations, the framework improves
precision by 28.57% and F-measure by 36.9% on a difficult human trafficking
geotagging task compared to a machine learning-based baseline. The method is
already being integrated into an existing knowledge base construction system
widely used by US law enforcement agencies to combat human trafficking.Comment: 6 pages, GeoRich 2017 workshop at ACM SIGMOD conferenc
Understanding Prior Bias and Choice Paralysis in Transformer-based Language Representation Models through Four Experimental Probes
Recent work on transformer-based neural networks has led to impressive
advances on multiple-choice natural language understanding (NLU) problems, such
as Question Answering (QA) and abductive reasoning. Despite these advances,
there is limited work still on understanding whether these models respond to
perturbed multiple-choice instances in a sufficiently robust manner that would
allow them to be trusted in real-world situations. We present four confusion
probes, inspired by similar phenomena first identified in the behavioral
science community, to test for problems such as prior bias and choice
paralysis. Experimentally, we probe a widely used transformer-based
multiple-choice NLU system using four established benchmark datasets. Here we
show that the model exhibits significant prior bias and to a lesser, but still
highly significant degree, choice paralysis, in addition to other problems. Our
results suggest that stronger testing protocols and additional benchmarks may
be necessary before the language models are used in front-facing systems or
decision making with real world consequences
Understanding Substructures in Commonsense Relations in ConceptNet
Acquiring commonsense knowledge and reasoning is an important goal in modern
NLP research. Despite much progress, there is still a lack of understanding
(especially at scale) of the nature of commonsense knowledge itself. A
potential source of structured commonsense knowledge that could be used to
derive insights is ConceptNet. In particular, ConceptNet contains several
coarse-grained relations, including HasContext, FormOf and SymbolOf, which can
prove invaluable in understanding broad, but critically important, commonsense
notions such as 'context'. In this article, we present a methodology based on
unsupervised knowledge graph representation learning and clustering to reveal
and study substructures in three heavily used commonsense relations in
ConceptNet. Our results show that, despite having an 'official' definition in
ConceptNet, many of these commonsense relations exhibit considerable
sub-structure. In the future, therefore, such relations could be sub-divided
into other relations with more refined definitions. We also supplement our core
study with visualizations and qualitative analyses.Comment: arXiv admin note: substantial text overlap with arXiv:2011.1408